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BAYES NETS IN EDUCATIONAL ASSESSMENT: 
WHERE DO THE NUMBERS COME FROM? 1 



Robert J. Mislevy, Russell G. Almond, 
Duanli Yan, and Linda S. Steinberg, 

CRES ST/Educational Testing Service 



Abstract 

Educational assessments that exploit advances in technology and cognitive psychology 
can produce observations and pose student models that outstrip familiar test-theoretic 
models and analytic methods. Bayesian inference networks (BINs), which include 
familiar models and techniques as special cases, can be used to manage belief about stu- 
dents' knowledge and skills, in light of what they say and do. BINs for assessments that 
add new tasks to their item pools and measure different students with different items can 
be assembled from building-blocks fragments. A student-model BIN (SM-BIN) fragment 
contains student model variables, which characterize aspects of knowledge. Evidence 
model BIN fragments (EM-BINs) contain observable variables and pointers to student 
model variables. Joining EM-BIN fragments to an SM-BIN fragment permits one to 
update belief about a student as observations arrive in a setting the EM-BIN was 
constructed to handle. Markov Chain Monte Carlo (MCMC) techniques can be used to 
estimate the conditional probabilities in the BINs from empirical data, supplemented by 
expert judgment or substantive theory. Details for the special cases of item response 
theory (IRT) and multivariate latent class modeling are given, with a numerical example 
of the latter. 



1. Overview 

This paper concerns statistical methods for managing uncertainty about stu- 
dents' knowledge, as evidenced by their performances in assessment tasks. Section 
2 sketches a framework for assessment design that includes the building blocks of 
the statistical model. They are student model Bayesian inference network (SM-BIN) 
fragments, which contain unobservable variables that characterize aspects of 
students' knowledge or skills, and evidence model fragments (EM-BINs), which 

1 We thank Eddie Herskovitz and Andrew Gelman for their contributions to this work and to Kikumi 
Tatsuoka for permission to use her data on mixed number subtraction. We gratefully acknowledge 
our intellectual debt to Dr. Tatsuoka, having leaned on the insights in her classroom observations, 
cognitive analysis, test design, and analyses. 



contain observable variables and pointers to student-model variables. The BIN 
fragments can be joined for updating belief about students' proficiencies as evidence 
arrives, an example of "knowledge based model construction" (KBMC; Breese, 
Goldman, & Wellman, 1994). 

Section 3 addresses the perennial question in expert systems, "Where do the 
numbers come from?" We describe a general probability model and a Bayesian 
approach to estimating the parameters of student and evidence models, calibrating 
new tasks into an existing assessment, and drawing inferences about students. 
Section 4 illustrates the ideas for computerized adaptive testing (CAT) with item 
response theory (IRT) models. Section 5 lays out a second special case, namely, a 
multivariate latent class model, and gives a numerical example. 

2. The Assessment Framework 

The essential problem in assessment is drawing inferences about what a 
student knows or can do, from limited observations of what she actually says or 
does in a relatively small number of particular settings. The present paper arises 
from a research program studying educational assessment from the perspective of 
evidentiary reasoning (Schum, 1994), the "Portal" project. The focus here is on 
statistical methods. Other presentations focus on cognitive psychology (Mislevy, 
1995, Steinberg & Gitomer, 1996), probability-based reasoning (Almond et al., 1999; 
Mislevy & Gitomer, 1996), assessment design (Almond & Mislevy, 1999; Mislevy, 
Steinberg, & Almond, in press); and computer-based simulation (Mislevy et al., 
1999; Steinberg & Gitomer, 1996). 

A quote from Messick (1992) captures the spirit of our approach to assessment 
design: 

A construct-centered approach would begin by asking what complex of knowledge, 
skills, or other attribute should be assessed. . . . Next, what behaviors or performances 
should reveal those constructs, and what tasks or situations should elicit those 
behaviors? Thus, the nature of the construct guides the selection or construction of 
relevant tasks as well as the rational development of construct-based scoring criteria and 
rubrics, (p. 17) 

Our work has two facets: a conceptual framework for assessment, and processes for 
developing and implementing specific applications built according to the 
framework. Figure 1 is a schematic representation of the four high-level objects in a 



Portal conceptual assessment framework where the issues of statistical inference 
arise. 

• The Student Model contains unobservable variables, denoted 0, = (d n ,...,d iK ) 
for Examinee i, which characterize the aspects of knowledge and skill that 
are the targets of inference in the assessment. The SM-BIN manage our 
belief about 6 i in terms of a probability distribution. The student model 

variables for all N examinees in a sample of interest is denoted 6. 

• An Evidence Model first describes how to extract the salient bits of evidence 
from what a student says or does in the context of a task (the work 
product). Evidence rules produce the values of observable variables, 
denoted Xj =(X jV ...,X JM ) for Task j. An evidence model also describes, in 

terms of the structure of an EM-BIN, how each X j depends on 0. The 
complete collection of responses across all examinees and all tasks is 
denoted X. 

• A Task Model describes the features of a task that need to be specified. This 
includes specifications for the work environment, tools the examinee may 
use, the work products, stimulus materials, and interactions between the 
examinee and the task, as consistent with the evidentiary requirements of a 
conformable evidence model. The characteristics of a task are expressed by 
task model variables, Yj = (Y /1 ,...,Y /l ) for Task/; they are determined by the 

test developer, and are known with certainty. The complete collection of 
task features for all tasks in the item pool is denoted Y. 




Assessment Assembly Specifications 





Figure 1. High-level assessment design objects. 



• The Assembly Model describes the mixture of tasks that go into an 
operational assessment, either the specification of a fixed test form or a 
procedure for determining tasks dynamically. 

3. The Probability Framework 

According to Gelman et al. (1995, p. 3), the first step in Bayesian analysis is 
setting up a full probability model — a joint probability distribution for all observable 
and unobservable quantities in a problem. "The model," they continue, ' should be 
consistent with knowledge about the underlying scientific problem and the data 
collection process." In assessment, what we know about the domain identifies the 
nature of the targeted knowledge and skill, the ways in which aspects of that 
knowledge are evidenced in performance, and the features of situations that provide 
an opportunity to observe those behaviors. We incorporate this information in a 
student model and a series of evidence models. The key conditional independence 
assumption posits that the aspects of proficiency expressed in the student model 
account for the associations among responses to different tasks (although we may 
allow for conditional dependence among multiple responses within the same task). 

3.1. The Probability Model 

The pertinent variables in assessment obviously include tasks' Ys, all of which 
are observable; examinees' 0 s, which are not; and Xs, which are potentially 
observable. Structures and parameters that reflect interrelationships among these 
variables, consistent with our knowledge about them, are also needed. We may 
build the required structures from SM-BINs and EM-BINs. This section describes 
them in general terms, while Sections 4 and 5 work through special cases from item 
response theory and latent class modeling. 

The SM-BIN for Examinee i is a probability distribution for 0 ( . An assumption 
of exchangeability posits a common prior distribution for all examinees before any 
responses are observed, with beliefs about expected levels and associations among 
components expressed through the structure of the model and higher level 
parameters A ; whence, for all Examinees i, 



0. ~ p(d |A). 



( 1 ) 



Depending on the strength with which theory and experience inform population- 
level beliefs, p(A) could range from vague to precise. 

As noted above, the evidence model for a class of tasks contains (1) rules for 
ascertaining the values of observable variables X from a student's work product, 
and (2) the structure of a probability model for X given 6. We focus on the latter. 

Evidence models, indexed by the s, each support a class of tasks that provide values 
for a similar set of observable variables X (J) ; further, the dependence structure of 

these X, ,son 6 is the same for all tasks j using the same evidence model. Thus the 

EM-BINS for task sharing the same evidence model will have the same graphical 
structure and exchangeable parameters (probability tables), but the conditional 
probability distributions within that structure can differ. As Sections 4 and 5 
illustrate, this structure is guided by theory about proficiency in the domain and 
careful task construction that evokes targeted aspects of that proficiency. 

Let n (s)j denote the parameters of the EM-BIN distributions for Task j which 
uses the structure of evidence model s(j), (or more simply, s). The distribution of the 
responses of Examinee i to Task; is 

All the tasks using an Evidence Model s produce observables X (s) of the same 
form, contributing information about the same components of 6 . But within this 
common evidentiary structure, features of the tasks, encoded as Ys, can moderate 
these relationships. For example, unfamiliar vocabulary and complex sentence 
structures tend to make reading comprehension tasks more difficult. The parameters 
n for particular tasks may thus be modeled as exchangeable within evidence 

models given the values of designated task model variables 7 (J) ; that is, 

n U)j ~ ^( 7^ (^)|^(J)y ,7 0 , (3) 

with prior beliefs expressed through higher-level distributions We assume 

that X (s)iJ does not depend on Y Wj other than possibly through n (s)j . The complete 
collection of probabilities for all EM-BINs for all tasks is denoted K and the complete 
collection of a prior parameters for those probabilities is denoted T) . 

The full probability model for the responses X {s)ij of N examinees to / tasks 
nested within S evidence models can now be written as 



s j i 

Figure 2 is a generalized form of an acyclic directed graph ("DAG") 
representation of this model, with boxes representing replicated elements 
(Spiegelhalter et al., 1996). The structure and the nature of the distributions is 
tailored to the particulars of an application. In the sequel, we will omit the evidence 
model subscripts (s) from Xj, Yj, and Jtj, when they are not needed. 



replication over 
examinees (i) 



replication over 
evidence models (s) 




replication over 
tasks (j) 



Figure 2. DAG representation of the Probability Model. X tj is the response of 
Student i to Task j; 0, is the parameter of Examinee i ; A is the parameter of the 
distribution of 6 s; K {s)j is the parameter for Task which uses Evidence Model s; 
y are the task model variables for Task and T\^ is the parameter of the 
distribution of 7t (s)j s. All of these parameters can be vector-valued. 

3.2 Statistical Inference 

In general, the second step of Bayesian inference involves conditioning on 
observed data. Continuing from the preceding section, this would mean 
conditioning on whatever observations X are made (say X old ), to yield a posterior 
distribution for the unobservable parameters 0,71,11, and A, and predictive 
distributions for Xs not yet observed (say X nelv ); i.e., p(X 
Parameters and unobserved responses that are not of immediate interest can be 
integrated out of this joint posterior to provide marginal posterior distributions for 
specified variables as desired. 

What are the jobs in an ongoing operational assessment? Primarily, we want to 
learn about the 0 s of individual examinees, for such purposes as making selection 



decisions, planning instruction, providing feedback on learning, informing 
policymakers, and guiding students' work in a coached practice system. Usually we 
can observe a student's responses to only a limited number of tasks. On the other 
hand, we can often observe responses to a given task from a large number of exami- 
nees. From these observations we refine our knowledge about how responses to a 
given task depend on 0; that is, the tts. This knowledge provides a means of 
selecting tasks to administer to examinees, updating our beliefs about their 0 s, and 
estimating the conditional probability distributions for new items. This knowledge 
is used to selecting tasks to administer to examinees, update our beliefs about their 
0s, and estimate the 7is of new items. 

3.2.1 Inference About Examinees 

Consider inference about Examinee i when 71 ,^, and A, are known to take the 
values of and A* respectively. This situation may be approximated in an 

ongoing program with considerable data about these parameters (Sections 3.2.2 and 
3.2.3). Suppose we observe Examinee i's responses to tasks 1 through /. The 
objective is to proceed from the prior distribution p(0,|A*) to the posterior 

P(0,|A .Xj] , . . Xy , TTj , TTy ^ . 

The SM-BIN for Examinee i is a probability distribution for 0,.. Its initial status 
is p(6\X'\ Following (2), the EM-BIN for Task 1 is p(x\0 i X)- Together they imply 
the joint distribution of 0, and X,, namely p[X x ,0\X ,n\) = p(X l \d i ,n l )p(d i \X*). Once 
x is observed, Bayes Theorem yields an updated distribution for 0,: p[G i |A\x n X). 
To it we can attach the EM-BIN for Task 2, or p(X 2 \0 i ,n 2 ) / and use Bayes Theorem 
again to obtain p(d i \X\x il ,x i2 ,K l ,n 2 ) once x i2 is observed. So on through Task /. 

Note that the capability to dock evidence-model BIN fragments with the student- 
model BIN fragment, absorb evidence from it, then discard it in preparation for the 
next task is made possible by the conditional independence structure across 
observations from different tasks — a structure generally achieved only through 
careful study of proficiency in the domain and principled task construction in its 
light. 

When all the student model variables and observable variables are discrete, the 
belief updating equations all have closed form (Lauritzen & Spiegelhalter, 1988). 
Complications arise when one wishes to assemble fragments on the fly, in ensuring 
that a proper join tree can be secured for each concatentated BIN. Almond et al. 



(1999) offer one solution to this problem: forcing edges in the student-model BIN 
among student-model variables, which are parents of some observable in any 
evidence model that may be used. 

Rarely are and X known with certainty. Fully Bayesian inference deals 
with them and all the 6 s at once (Section 3.2.2). The modularity of SM-BINs and 
EM-BINs that suits KBMC can be maintained by using facsimiles that replace n and 
x with point estimates 7 C and X— e.g., posterior means given X otd — or marginal 
approximations p{6) = J p(0|A,X oW )p(A)dA and p(^(iw|®<) = 

KiA • 

3.2.2 Inference About Higher Level Parameters 

When an operational assessment program is initiated, responses from a large 
sample of examinees may be used to draw sharp inferences about the parameters of 
the population of examinees and a startup set of tasks. The inferential targets are X , 
T|, and n old , and the relevant posterior distribution is p(n old , t|,A|Y,X oW ). The results 

of this analysis can be used to construct SM- and EM-BINs for use with future 
examinees. 

The details of such analyses have been worked out for special cases of familiar 
assessment practices, such as the IRT methodologies outlined in Section 4. Recent 
work with Monte Carlo Markov Chain (MCMC) estimation (e.g., Gelman et al., 
1995) provides a general approach that can be applied flexibly with new models as 
well, and suits the modular construction of probability distributions in KBMC. A 
full treatment of MCMC methods is beyond the current presentation. It suffices here 
to state the essential idea: to produce draws from a series of distributions 
constructed in the manner sketched below, which is equivalent in the limit to 
drawing from the posterior distribution of interest. 

We address p(0,i i*,H,A|Y,X ( *) in the present problem using a Gibbs sampler. 
Iteration t+1 starts with values for each of the parameters, say {0 ',jc' oW ,t| , ,A , |. A 
value is then drawn from the following conditional distributions: 

Draw 0’ +1 from p{S^n ' old , T)', A' , Y, X oW 
draw tc'J from p(n old |0 ,+1 ,ti',A',Y,X oU ); 
draw T | ,+1 from p(Ti|0' +1 ,7t' o ) d 1 ) A' ) Y ,X oW ); and 



draw A' +1 from p{k \e' + \nX^' + \Y,X old ). 

Once the process is stationary, the distribution of a large number of draws for a 
given parameter approximates its marginal distribution. Summaries such as 
posterior means and variances can be calculated, which may be used to construct 
self-contained SM- and EM-BIN fragments. We used the Spiegelhalter et al. (1996) 
BUGS program in the example in Section 5. See Gelman et al. (1995) on assessing 
convergence and criticizing model fit. 

3.2.3 Inference About New Tasks 

Ongoing assessment programs continually add new tasks to the item pool, 
whether to help maintain security, to extend the range of skills addressed, or simply 
to provide variety for students. We assume that the new items are created in 
accordance with existing task models and conformable evidence models. We must 
estimate the ns for the EM-BlNs of the new tasks. 

Suppose we have already obtained responses X gld from a sample of examinees 
for a set of tasks 1.../, and by methods such as those described above obtained 
posterior distributions p(A |X oU ), p(^7 (J )|X oW ) for s=l...S, and p(nj\Yj,X oU } for )=1.../. 

We wish to calibrate into the set a new Task /+ 1, which uses Evidence Model s[J + 1] 
and is characterized by task features Y J+V We obtain responses X new from a sample 
of N new examinees to both Task /+ 1 and previously-calibrated tasks. The objective 
now is to obtain an approximation p[n J+ ^Y j+v X old ,X new ^ that we can use to produce 

the EM-BIN for Task J+l. 

A first approach acknowledges remaining uncertainty about the parameters of 
the old tasks and the examinee and task hyper distributions. Posterior distributions 
from the startup estimation are employed as the priors for A, i\, and n old . These are, 
respectively, p(A|X oW ), p(tl|X oW ) and p(n old \Y oU ,X old ). The iterations in an MCMC 

solution echo those of the startup estimation: One draws successively for A , r\, and 
jc oW as well as for n ]+x and 0 neiv . In addition to posteriors for 7 Ty + j and 0,^, based 
on X MW , one obtains updated distributions for A, tj, and n old based now on both 
X oW and X new . 

A second, simpler, approach treats the previous point estimates as known. The 
probability model for this so-called "empirical Bayes" approximation is 



X p{^J+\\^s[J+\]^J+\)p{^new 
= 1 

X p(^y+i|^s[y+i]’^y+i)p|^/ 



i=i 7=1 



MCMC estimation approximates the posterior 



p{®new > Kj + 1 ^ new > ^ » “H. ^1 > • • • > > Y J+\ j 

with iterations of the following form: 

Draw ®' +1 from p[& rf J+v k,x\,ii v ...,iij,Y J+ ^-, 

Draw k‘j + ^ from p\k j+x ,X,X[,n x ,...,Jtj,Y J+x j. 



This second approach is simpler because it treats parameters known only 
partially, namely A, tj, and 7t oW , as if they were known with certainty. This 
expedient can distort the resulting posterior for n J+u understating uncertainty and 
possibly changing its shape or location. Just how tight the distributions for A, tj, 
and K old must be for these distortions to be negligible is an empirical question, as 

illustrated in Section 5. 



4. Item Response Theory and Adaptive Testing 

This section discusses computerized adaptive testing (CAT) with item response 
theory (IRT). In CAT, the preceding ideas have been applied in large-scale opera- 
tional testing programs such as the Graduate Record Examination (GRE) and the 
Armed Services Vocational Aptitude Battery (ASVAB). It is a good example because 
both the student model and the observations are fairly simple, and the 
methodologies have evolved over the past fifty years in the context of educational 
testing. 



4.1 Item Response Theory (IRT) 

An IRT model expresses an examinee's propensity to perform well in a domain 
of test items, in terms of a single unobservable proficiency variable 0. Item re- 
sponses are posited to be independent, conditional on 0 and item parameters that 
express characteristics such as items' difficulty or their sensitivity to proficiency. 
The Rasch model for J dichotomous test items is an example: 



p(x„...,x,\e,p ./>,)- 



(5) 



7=1 



where x j is the response to Item j (1 for right, 0 for wrong), /3 ; is the difficulty 
parameter of Item ;, and P(x ; l0,/3 ; ) = exp[x ; .(0-/3 y )J/[l + exp(0-/3 y )]. The (3jS play 
the role of the tt ; s in the notation of Section 3. 

The student model in IRT contains the single proficiency variable 0, and an 
SM-BIN is just a probability distribution for 0 — initially p{0). A task model 
specifies a set of salient features of a class of items, or task model variables T ; that 

concern content areas, cognitive demands, item format, work product specifications, 
and so on, as required to assemble tests or model item parameters. An evidence 
model contains the rules for determining the value of the response Xj from an 

examinee's work product, such as a rubric a rater uses to evaluate a free response or 
a correct answer against which an examinee's multiple-choice response is compared. 
An evidence model also specifies the structure of EM-BINs, which in this example 
are identical in form but generally differ as to the value of /3 ; . The evidence model 

may further posit a relationship between (5 s and Y } (see Section 4.3). 

The likelihood function (5) corresponds to catenated EM-BIN fragments. Once 
an examinee's response vector x - (x,,...,x y ) is observed, it is viewed as a likelihood 

function for 0, say L(0|x,B). Bayesian inference is based on the posterior p(0|x,B) « 
L(0|x,B)p(0), where B =(/?,,. ..,/3 y ). Then p(0|x,B)can be summarized by its posterior 

mean 0 and variance Var(0|x,B). 



4.2 Inference About Examinees: CAT 

A fixed test form provides different accuracy for different values of 0, with 
greater precision when 0 lies in the neighborhood of the items' difficulties. CAT 
tailors the test's level of difficulty to each examinee. Testing proceeds sequentially, 
with each successive item k+1 selected to be informative about the examinee's 0 in 



light of the responses to the first k items, or (Wainer et al., 1990, Chap 5). A 
Bayesian approach to CAT starts from a prior distribution for 6 and selects each 
next item j to minimize expected posterior variance, or 
E x [ Vbr(0|jc ( *\ x ; . , B w ,Pj )|jc < * ) ,B < * ) ] . Additional constraints on item selection can be 

incorporated into the assessment assembly algorithm, such as item content and 
format encoded as task model variables Y j (Stocking & Swanson, 1993). Testing 

ends when a desired measurement accuracy has been attained or a specified number 
of items has been presented. 

Figure 3 depicts the SM-BIN and EM-BINs in IRT-CAT. Figure 3a shows the 
SM-BIN on the left, consisting of the single SM variable 6 and the distribution object 

that contains current belief about its unknown value. On the right is a library of EM- 
BINs, each linked to a particular task. The observable variable x j appears, along 
with the distribution object that contains the IRT conditional distribution for Xj 

given 9. Figure 3b shows an EM-BIN "docked" with the SM-BIN to absorb 
evidence in the form of a response to the corresponding item. 




Task Library 

a) SM-BIN and Task/EM-BIN Library 




b) EM-BIN for Item 2 "docked” with SM-BIN 



Figure 3. SM-BIN and Task/EM-BINs in IRT-CAT. The distribution object 
for the SM-BIN contains the distribution for 6; those for the tasks contain 
the conditional distributions of the item response given 6 . 




4.3 Inference About Higher Level Parameters. 

For selecting items and scoring examinees in typical applications, estimates of 
the item parameters are obtained from large samples of examinee responses and 
treated as known. This procedure plays the role of the MCMC estimation described 
in Section 3.2.2. Bayes modal estimation and maximum likelihood (Bock & Aitkin, 
1981) are widely used, although MCMC methods are appearing (e.g., Albert, 1992). 

There is growing interest in exploiting collateral information about test items 
features Yj to reduce the number of pretest examinees needed to estimate item 

parameters (Mislevy, Sheehan, & Wingersky, 1993). For example, Scheuneman, 
Gerritz, and Embretson (1991) accounted for about 65% of the variance in item 
difficulties in the Reading section of the National Teacher Examination with 
variables for tasks' syntactic complexity, semantic content, cognitive demand, and 
knowledge demand. Fischer (1973) integrated cognitive information into IRT by 
modeling Rasch item difficulty parameters as linear functions of effects for item 
features. Incorporating a residual term to allow for variation of difficulties among 
items with the same features gives 

K 

Pj = 2, ^kj tfl t "*■ £ j ’ 

*=1 

where r\ ^ is the contribution of Feature k to the difficulty of an item, Ykj is the extent 
to which Feature k is represented in Item j; and £j is a N(0 ,(f) 2 ) residual term. 
Sheehan and Mislevy (1990) used this model with item features based on cognitive 
analysis of the difficulty of document literacy tasks. 

4.4 Inference About New Tasks 

CAT selects items according to their difficulty parameters in order to maximize 

information about an examinee's 0. To do this one must know something about the 
fys. Now testing programs continually introduce new items into the item pool so 

items do not become spuriously easy after examinees share them. Estimating the fis 
of new items within the context of operational testing is called "on-line calibration." 
This is usually done by administering examinees both optimally-determined items 
whose fis are well-estimated and randomly-selected new items whose fis are not 
known. The responses to the former are used to determine the examinee's 
operational score, while the responses to the latter are used to learn about the new 



items' p s. This is the situation discussed in Section 3.2.3. Standard practice is to es- 
timate the parameters of new items using the empirical Bayes approximation; that is, 
the parameters of the "old" items are treated as known. Empirical studies have 
shown this expedient yields satisfactory estimates for B nw . The evidentiary value of 
Ys for ps can also be exploited in on-line calibration, in order to reduce the number 
of pretest examinees that are needed; knowing that a vocabulary item tests a 
common word, for example, gives it an initial prior distribution anticipating a 
lower-than-average difficulty parameter. 

4.5 A Pointer to Factor Analysis 

Without working through the details, we note in passing how neatly another 
mainstay of psychometrics, factor analysis (Thurstone, 1947), falls into the structure 
outlined in Section 3. In the notation of Section 3, the basic equation of factor 
analysis is 

=X*A + v ( 6 ) 

k 

where x tj is the observable test score of Examinee i on Test j; 7tj is the loading 
(regression coefficient) of Test j on the unobservable Factor k, 0 ik is Examinee i s 
value on Factor k, and e tj is a residual, independent of 6 and having variance 
(j 2 the unique variance of Test j. Equation 6 implies that for standardized test 

scores and factors, 

= JiL e n' + diag(<J, 2 , . . . , <7, ), 

where and Z e are the correlation matrices of the scores and factors, respectively. 

Factor analysts were initially concerned with determining the number of 
factors in a given problem and estimating the factor loadings — fundamentally the 
problem discussed in Section 3.2.2. Issues of resolving indeterminacies among factor 
solutions and of distinguishing exploratory and confirmatory analyses can be 
viewed as issues of specifying prior distributions for ns and <jJs (Schemes, Hoijtink, 

& Boomsma, 1999). Once a solution is accepted, what can be said about a particular 
examinee's factor values given her test scores? Factor score estimation (Cattell, 1978, 
Chap. 11) addresses this question — the problem of Section 3.2.1. And if the factor 
loadings of a set of tests have been estimated from one data set, can loadings for 
additional tests on the same factors be obtained from new examinees' scores on both 



the original tests and the new ones? Dwyer (1937) answered in the affirmative by 
introducing "extension loadings"— in essence the problem discussed in Section 3.2.3. 

5. A Multivariate Latent Class Model 

This section concerns binary skills latent class models (Haertel, 1984). We give 
numerical results from analyses of Tatsuoka's (1990) data on mixed number 
subtraction with middle school students. 

5.1 Binary Skills Models 

In a binary skills model, the student model contains a vector of K 0/1 variables 
q. = (0 n ,...,0 iK ), each of which signifies that an examinee either does (1) or does not 

(0) possess some particular element of skill or knowledge in some learning domain. 
A task in this domain is similarly characterized by a vector of K 0/1 task model 
variables Yj =(^,,...,7^) that indicates whether a task does (1) or does not (0) 

require each of these skills for successful solution; these values are known with 
certainty, and are determined by the features of task's construction and the skills 
that theory says are required to solve it in light of those features. The statistical 
component of the evidence model posits that an examinee is likely to succeed on a 
task (Xj = 1) when she possesses the skills it demands, and likely to fail (Xj= 0) if 

she lacks one or more of them. 

5.2 The Method B Network 

This example is grounded in a cognitive analysis of middle-school students' 
solutions of mixed-number subtraction problems. Klein et al. (1981) identified two 
methods of solution: 

Method A: Convert mixed numbers to improper fractions, subtract, then 
reduce if necessary. 

Method B: Separate mixed numbers into whole number and fractional parts, 
subtract as two subproblems, borrowing one from minuend whole number if 
necessary, then simplify and reduce if necessary. 

We focus on students learning to use Method B. The cognitive analysis mapped out 
a flowchart for applying Method B to a universe of fraction subtraction problems. A 
number of key procedures appear, which a given problem may or may not require. 
Students had trouble solving a problem with Method B when they could not carry 
out one or more of the procedures an item required. Instruction was available to 



review each procedure. The purpose of the test in this example was to determine 
which procedures a student should review, among five procedures that are 
sufficient for mixed-number subtraction problems when no common denominator 
needs to be found. The procedures are defined at the grain-size of the review 
lessons; they are as follows: 

Skill 1: Basic fraction subtraction. 

Skill 2: Simplify /reduce fraction or mixed number. 

Skill 3: Separate whole number from fraction. 

Skill 4: Borrow one from the whole number in a given mixed number. 

Skill 5: Convert a whole number to a fraction. 

d ] ,...,d 5 are student-model variables that correspond to having or not having 
each of these skills, with the idea that a student with a low probability of having a 
skill would benefit from the corresponding review session. Prior analyses revealed 
that Skill 3 is a prerequisite to Skill 4. We introduced a three-level variable, 0 WN , that 
incorporates this constraint. Level 0 of G WN means having neither of these skills; 
Level 1 means having Skill 3 but not Skill 4; Level 2 means having both of them. 

Table 1 lists fifteen items from Dr. Tatsuoka's data set, characterized by the 
skills they require— i.e., their Ys. The list is grouped by patterns of skill 
requirements. All the items in a group have the same structural relationship to 6. 
They require a student have the same conjunction of skills in order to make a "true 
positive" correct response. They accord with the same evidence model, and will 
have EM-BIN fragments with the same graphical model. 

We re-analyze data that Dr. Tatsuoka collected and analyzed with her Rule- 
Space methodology, which also used a binary skills foundation but with a somewhat 
different set of skills and a pattern-matching approach to handling uncertainty. We 
consider the responses of 325 students deemed to be using Method B. 

5.3 The Probability Model 

The full probability distribution for all 325 examinees and 15 items has the form 
shown in (4). The distributions are specified as follows. 
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9 Skill Requirements for Fraction Items 
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The student model variables are (0,,...,0 S ,0,| W ). Preliminary analyses based on 
point estimates from Tatsuoka's analysis led us to the structure depicted in Figure 4. 
Edges represent conditional dependence relationships, with directions chosen 
according to the usual instructional order. Recalling that each of the variables 6 k is 
binary and d WN has three levels, we may describe the SM-BIN, or p[d |A), as follows: 

is Bernoulli with probability ^i; that is, ~ Bern (A,). 

d 2 depends on e \\ 6 2 \6 x =z ~ Bern(A 2z ) for z= 0,1. That is, there may be 
different probabilities of having Skill 2 depending on whether a student 
does or does not have Skill 1; those probabilities are A 20 and 

respectively. 

9 S depends on and : 0 5 |(0j + d 2 = z)~Bem(A Sz ) for 2=0, 1,2. That is, there 
may be different probabilities of having Skill 5 depending on whether a 
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ERIC 



17 



21 



student has Skills 1 and 2; we allow for different probabilities depending 
on how many of them the student has: A 50 if neither, A 51 if just one of 

them, and A 52 if both. 

6 wn can take values 0,1,2; it depends on e \, and 0 5 : 

e wN \{e> + e 2 + d 5 =z)~ Cat^^A^A*^), for 2 = 0 , 1,2,3. As above, the 
probabilities for d WN are modeled as depending on other skills, and only 
the count of those mastered is distinguished. 

0 3 =O if 9 m = 0; 0 3 =1 if 9 m = 1 or 2. 

0 4=O if 6 WN = 0 or 1; ^=1 if 0,^=2. 

The last two of these relationships are logical rather than probabilistic, effecting the 
prerequisition relationship between 0 3 and 0 4 . 




Figure 4. DAG for student model for mixed number 
subtraction. Squares represent student-model variables; 
round tangles represent distribution objects. 



We specified, for each A , a prior distribution with an effective sample size of 
25. These are Beta(a,j3) for the 9 m s that are parameters of Bernoulli distributions, 
with a=21 and /3=6 when the probability is expected to be high (e.g., students who 
have Skill 1 are likely to have Skill 2) and vice versa when the probabilities are 
expected to be low (students who don't have Skill 1 probably don't have Skill 2 





either). We used Dirichlet priors for the A 5 vectors, positing increasing belief of 
having Skills 3 and 4 as a student has more of Skills 1, 2, and 5. 

Evidence models correspond to patterns of 0,,...,0 5 that are required to solve a 
class of items, at least one of which appears in the 15-item data set. There are six 
such patterns, which can be described either in terms of the vector of skills required 
or equivalently by the pattern of Task Model variables Y of items that conform with 
that evidence model. The evidence models and the items that use them can be read 
from Table 1. For example, Evidence Model 3 is characterized by Y = (1,0, 1,0,0), and 

Items 4-6 accord with it. 

The EM-BINs take the form of misclassification matrices, specified by a false 
positive probability n J0 of a correct response if the examinee does not have the 

conjunction of skills associated with the evidence model Task j uses, and a true 
positive probability K jX of a correct response if she does. We denote by S t(s) whether 

Examinee i has the skills needed for tasks using evidence model s; it takes the value 
1 if she does and 0 if she does not. 

The EM-BIN for Task j, which uses evidence model s, contains the observable 
response X jt pointers to the student model variables for which Y (J) *=1, and the 

following conditional probability distributions: 





~ Bern(^ z ), for z=0,l. 



That is, the probability of a correct response, or ^.=1, follows a Bernoulli 
distribution, with probability parameter K jX if Student i does have the required skills 
and n j0 if she does not. These conditional probabilities are allowed to differ from 

item to item, both within and across evidence models. Figure 5 shows the structure 
of EM-BINs for s = 2 and 4. 



For priors for the ns, we again imposed Beta distributions with effective 
sample sizes of 25. These are Beta(21,6) for K jX s, or true positives, and Beta(6,21) for 
n S/ or false positives. This corresponds to the prior expectation that students who 

do have the necessary skills will answer an item correctly about .8 of the time, and 
students who don't will answer correctly only about .2 of the time. These priors are 
just initial guesses. We expect, and indeed observe, substantial changes from the 
priors in the posterior means. 








Figure 5. EM-BIN structures for tasks using Evidence 
Models 2 and 4. Distribution object represents distributions 
of response X ; given values of student-model parents 
indicated by pointers to student-model variables. 



5.4 Inference About Examinees 



In an operational assessment, inference about an individual examinee starts 
with the possibly-diffuse population prior distribution — i.e., the SM-BIN initialized 
at P (6 >|a) or at p(d \X M ) = \ p[d\X)p(X\X old )dX , depending on the approximation being 



used. EM-BIN s for the items to which responses are observed are joined with the 
SM-BIN, and evidence is absorbed into the SM-BIN (Mislevy, 1995). 



Table 2 gives an illustration from the present example. The values of the A s 
and tts were fixed at the posterior means of the first run in the following section, 
and Bayes net calculations were carried out with the ERGO computer program 
(Noetic Systems, 1991). We see how beliefs are changed after observing an examinee 
give mostly correct answers to items requiring skills other than Skill 2, but not those 
that do require it. The base-rate and the updated probabilities show substantial 
shifts toward the belief that this examinee has Skills 1, 3, 4, and possibly 5, but al- 
most certainly not Skill 2. 



Table 2 

Profile of Skill-Mastery for X = (1,1,0,1,1,0,1,1,0,1,1,1,04,0) 



Skill 


Prior probability 


Posterior probability 


1 


.883 


.999 


2 


.618 


.056 
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.937 


.995 
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.406 


.702 


5 


.355 


.561 



O 




20 



24 



5.5 Inference About Higher-Level Parameters 

As a baseline against which to compare subsequent runs that better mirror 
operational work, we used BUGS to estimate the full probability model from Section 
5.3 with all 15 items and all 325 examinees. Table 3 gives summary statistics from 
this run for selected parameters. The posterior means and standard deviations of the 
parameter estimates appear, along with method-of-moments estimates of Beta 
distributions these posteriors imply. Recalling the priors were Beta distributions 
with an effective weight of 25 observations, the column labeled n approximates the 
effective number of observations the data was worth in estimating each parameter. 
They are always less than the actual sample size of 325, since examinees' actual skill 
vectors are not known with certainty. 



Table 3 

MCMC Estimation, All Tasks, 325 Examinees 





Parameter /State 


Mean 


SD 


a 


P 


n 


X\ 




.81 


.02 


204 


49 


226 


kl 


Ai=0 


.21 


.07 


11 


23 


8 




Ai=l 


.90 


.03 


134 


11 


118 


7T4 


False Positive 


.19 


.05 


12 


51 


37 




True Positive 


.92 


.02 


193 


16 


182 


*5 


False Positive 


.20 


.04 


16 


63 


52 




True Positive 


.91 


.02 


173 


18 


164 


*8 


False Positive 


.09 


.02 


20 


211 


204 




True Positive 


.87 


.03 


114 


17 


104 


*10 


False Positive 


.04 


.01 


9 


199 


181 




True Positive 


.81 


.03 


109 


26 


108 


*12 


False Positive 


.18 


.03 


38 


169 


180 




True Positive 


.75 


.04 


109 


36 


118 


*14 


False Positive 


.05 


.01 


12 


218 


203 




True Positive 


.68 


.04 


90 


42 


106 



9 Table 4 affects a startup run in an operational testing program. Two hundred 

twenty-five of the examinees were sampled, and parameters were estimated in 
BUGS for the X s and for the ns of 12 items. This run establishes the statistical 
framework for subsequent inferences about new examinees and new items. The 
9 rows with values show posterior means similar to those of the baseline run, but 

slightly higher standard deviations. Translated to approximate Beta distributions, 
they show proportionally lower effective sample sizes. The blank rows correspond 
to the 3 items not administered; they are the "new" items to which we now turn our 
9 attention. 



Table 4 

MCMC Estimation, 12 Tasks, 225 Examinees 





Parameter /State 


Mean 


SD 


a 


p 


n 


h 




.80 


.03 


144 


37 


154 


*2 


Ai=0 


.23 


.08 


6 


21 


1 




X\-\ 


.90 


.03 


96 


10 


80 


K 4 


False Positive 


.15 


.05 


8 


42 


23 




True Positive 


.92 


.02 


135 


11 


119 


715 


False Positive 


— 


— 


— 


— 


— 




True Positive 


— 


— 


— 


— 


— 


718 


False Positive 


.10 


.02 


17 


155 


145 




True Positive 


.83 


.04 


65 


14 


52 


7TL0 


False Positive 


— 


— 


— 


— 


— 




True Positive 


— 


— 


— 


— 


— 


7TL2 


False Positive 


.16 


.03 


23 


121 


117 




True Positive 


.74 


.04 


75 


27 


74 


7TL4 


False Positive 


— 


— 


— 


— 


— 




True Positive 


— 


— 


— 


— 


— 



» 
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5.6 Inference About New Tasks 

We carried out two BUGS runs to calibrate the three new items into the 
assessment, each reflecting one of the on-line calibration strategies outlined in 
Section 3.2.3. The response data for both runs are the same: responses to all 15 items 
from the 100 examinees not used in the setup run. 

Table 5 summarizes the results from a Bayesian approximation in which the A s 
and the res about which evidence was obtained in the first run are started with Beta 
or Dirichlet priors that reflect the posteriors from the setup run, via the method of 
moments approximations. For these parameters, the resulting posteriors agree well 
with the results from the 325-examinee setup run — they are based on the same ex- 
aminees, although the responses to the three new items from the 225 startup sample 
of examinees is not included. The posteriors for the three new items, 
correspondingly, do not match quite as closely and translate to lower effective 
sample sizes. 



Table 5 

Three New Tasks, 100 Examinees, Priors From Previous Run 





Parameter/State 


Mean 


SD 


a 


P 


n 


Ai 




.81 


.02 


205 


49 


226 


^2 


Ai=0 


.22 


.08 


11 


21 


5 




Ai=l 


.90 


.03 


134 


13 


119 


7T4 


False Positive 


.19 


.05 


11 


47 


31 




True Positive 


.94 


.02 


192 


12 


177 


71 5 


False Positive 


.27 


.07 


11 


30 


14 




True Positive 


.89 


.03 


79 


10 


62 




False Positive 


.08 


.02 


19 


209 


201 




True Positive 


.85 


.03 


95 


17 


85 


x\0 


False Positive 


.09 


.03 


8 


79 


59 




True Positive 


.79 


.05 


49 


13 


35 


7T12 


False Positive 


.17 


.03 


35 


173 


181 




True Positive 


.75 


.04 


110 


38 


121 


7T14 


False Positive 


.07 


.03 


6 


75 


53 




True Positive 


.68 


.06 


43 


20 


36 



Table 6 summarizes the results from the empirical Bayes approximation, in 
which the A s and the 7rs about which evidence was obtained in the first run are 
fixed at the posterior means obtained in the setup run. The only parameters 
involved in the MCMC iterations were the 100 new examinees' 0s and the 3 new 
items' 7rs. We see that the posterior means for the new items agree almost exactly 
with those of the preceding Bayesian solution. The effective sample sizes are greater 
by about 3 on the average, which represents the impact of treating the A s and the 
7rs from the previous run as "known" rather than "less uncertain than they were." 
This modest overstatement of precision would seem acceptable in practical work. 



Table 6 

Three New Tasks, 100 Examinees, Priors fixed at Posterior Means 
From Previous Rim 





Parameter /State 


Mean 


SD 


a 


p 


n 


Ai 

A2 


Ai=0 


— 


— 


— 


— 


— 




Al=l 


— 


— 


— 


— 


— 


/T4 


False Positive 


— 


— 


— 


— 


— 




True Positive 


— 


— 


— 


— 


— 


*5 


False Positive 


.27 


.07 


12 


33 


17 




True Positive 


.89 


.03 


81 


10 


64 


ns 


False Positive 


— 


— 


— 


— 


— 




True Positive 


— 


— 


— 


— 


— 


n\0 


False Positive 


.09 


.03 


8 


80 


61 




True Positive 


.80 


.05 


48 


12 


34 


7T12 


False Positive 


— 


— 


— 


— 


— 




True Positive 


— 


— 


— 


— 


— 


*14 


False Positive 


.07 


.03 


6 


78 


57 




True Positive 


.68 


.06 


46 


21 


41 



Next Steps 

There are several fronts along which further work is needed. In an applied 
project, we are currently applying the approach illustrated in Section 5 to a 
simulation-based assessment of problem-solving in biology. We are considering 
alternative ways of joining SM- and EM-BINs that produce approximations in the 



SM-BIN posteriors, trading off exactitude for flexibility in larger problems. We also 
plan to develop templates for EM-BIN probability distributions that formally 
incorporate cognitively-relevant task model variables into response models (e.g., 
Wang, Wilson, & Adams, 1997). The most important lesson we have learned so far is 
the need for coordination across specialties in the design of complex assessments. 
An assessment that pushes the frontiers of psychology, technology, statistics, and a 
substantive domain all at once cannot succeed unless all are incorporated into a 
coherent design from the very beginning of the work. 
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